Skip to content

translate : recipes_source/distributed_device_mesh.rst#1132

Merged
jih0-kim merged 4 commits into
PyTorchKorea:masterfrom
ehdtjr:translate/distributed_device_mesh
Jun 1, 2026
Merged

translate : recipes_source/distributed_device_mesh.rst#1132
jih0-kim merged 4 commits into
PyTorchKorea:masterfrom
ehdtjr:translate/distributed_device_mesh

Conversation

@ehdtjr
Copy link
Copy Markdown
Contributor

@ehdtjr ehdtjr commented May 16, 2026

๋ผ์ด์„ ์Šค ๋™์˜

๋ณ€๊ฒฝํ•ด์ฃผ์‹œ๋Š” ๋‚ด์šฉ์— BSD 3ํ•ญ ๋ผ์ด์„ ์Šค๊ฐ€ ์ ์šฉ๋จ์„ ๋™์˜ํ•ด์ฃผ์…”์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋” ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๊ธฐ์—ฌํ•˜๊ธฐ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.

๋™์˜ํ•˜์‹œ๋ฉด ์•„๋ž˜ [ ]๋ฅผ [x]๋กœ ๋งŒ๋“ค์–ด์ฃผ์„ธ์š”.

  • ๊ธฐ์—ฌํ•˜๊ธฐ ๋ฌธ์„œ๋ฅผ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ, ๋ณธ PR ๋‚ด์šฉ์— BSD 3ํ•ญ ๋ผ์ด์„ ์Šค๊ฐ€ ์ ์šฉ๋จ์— ๋™์˜ํ•ฉ๋‹ˆ๋‹ค.

๊ด€๋ จ ์ด์Šˆ ๋ฒˆํ˜ธ

์ด Pull Request์™€ ๊ด€๋ จ์žˆ๋Š” ์ด์Šˆ ๋ฒˆํ˜ธ๋ฅผ ์ ์–ด์ฃผ์„ธ์š”.

์ด์Šˆ ๋˜๋Š” PR ๋ฒˆํ˜ธ ์•ž์— #์„ ๋ถ™์ด์‹œ๋ฉด ์ œ๋ชฉ์„ ๋ฐ”๋กœ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์˜ˆ. #999 )

PR ์ข…๋ฅ˜

์ด PR์— ํ•ด๋‹น๋˜๋Š” ์ข…๋ฅ˜ ์•ž์˜ [ ]์„ [x]๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ์„ธ์š”.

  • ์˜คํƒˆ์ž๋ฅผ ์ˆ˜์ •ํ•˜๊ฑฐ๋‚˜ ๋ฒˆ์—ญ์„ ๊ฐœ์„ ํ•˜๋Š” ๊ธฐ์—ฌ
  • ๋ฒˆ์—ญ๋˜์ง€ ์•Š์€ ํŠœํ† ๋ฆฌ์–ผ์„ ๋ฒˆ์—ญํ•˜๋Š” ๊ธฐ์—ฌ
  • ๊ณต์‹ ํŠœํ† ๋ฆฌ์–ผ ๋‚ด์šฉ์„ ๋ฐ˜์˜ํ•˜๋Š” ๊ธฐ์—ฌ
  • ์œ„ ์ข…๋ฅ˜์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๊ธฐ์—ฌ

PR ์„ค๋ช…

recipes_source/distributed_device_mesh.rst ๋ฌธ์„œ๋ฅผ ๋ฒˆ์—ญํ•˜์˜€์Šต๋‹ˆ๋‹ค.

@testofschool
Copy link
Copy Markdown
Contributor

์•ˆ๋…•ํ•˜์„ธ์š” ๋™์„๋‹˜, ์ „์ฒด์ ์œผ๋กœ ๊น”๋”ํ•œ ๋ฒˆ์—ญ์ธ๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค, LGTM!

@ptesogno
Copy link
Copy Markdown
Contributor

line 177์—์„œ ์ฐธ๊ณ  ๋ฌธ์„œ ์ œ๋ชฉ์ด ๋ฒˆ์—ญ๋˜์–ด ์žˆ๋Š”๋ฐ, line 178์ฒ˜๋Ÿผ ์›๋ฌธ ๊ทธ๋Œ€๋กœ ๋‘๋Š” ๊ฒŒ ๋‚ซ์ง€ ์•Š์„๊นŒ ์‹ถ์Šต๋‹ˆ๋‹ค.
์ˆ˜๊ณ  ๋งŽ์œผ์…จ์Šต๋‹ˆ๋‹ค!

Copy link
Copy Markdown
Member

@hyoyoung hyoyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

๊ธด ๋ฌธ์„œ ๋ฒˆ์—ญํ•˜๋А๋ผ ์ˆ˜๊ณ ํ•˜์…จ์Šต๋‹ˆ๋‹ค.
์‚ฌ์†Œํ•œ ์ œ์•ˆ ์‚ฌํ•ญ ๋ช‡๊ฐ€์ง€ ํ™•์ธ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค

=====================================================

**Author**: `Iris Zhang <https://github.com/wz337>`__, `Wanchao Liang <https://github.com/wanchaol>`__
**์ €์ž**: `Iris Zhang <https://github.com/wz337>`__, `Wanchao Liang <https://github.com/wanchaol>`__
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

์ €์ž ์•„๋ž˜์— ์—ญ์ž ํ•ญ๋ชฉ ์ถ”๊ฐ€ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค

DeviceMesh is useful when working with multi-dimensional parallelism (i.e. 3-D parallel) where parallelism composability is required. For example, when your parallelism solutions require both communication across hosts and within each host.
The image above shows that we can create a 2D mesh that connects the devices within each host, and connects each device with its counterpart on the other hosts in a homogeneous setup.
DeviceMesh๋Š” ์—ฌ๋Ÿฌ ๋ณ‘๋ ฌํ™” ๋ฐฉ์‹์„ ์กฐํ•ฉ(composability)ํ•ด์•ผ ํ•˜๋Š” ๋‹ค์ฐจ์› ๋ณ‘๋ ฌํ™”(์˜ˆ: 3-D ๋ณ‘๋ ฌ)๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ณ‘๋ ฌํ™” ๋ฐฉ์‹์ด ํ˜ธ์ŠคํŠธ ๊ฐ„ ํ†ต์‹ ๊ณผ ๊ฐ ํ˜ธ์ŠคํŠธ ๋‚ด๋ถ€์˜ ํ†ต์‹ ์„ ๋ชจ๋‘ ์š”๊ตฌํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค.
์œ„ ์ด๋ฏธ์ง€๋Š” ๊ท ์ผํ•œ ํ™˜๊ฒฝ์—์„œ ๊ฐ ํ˜ธ์ŠคํŠธ ๋‚ด๋ถ€์˜ ๋””๋ฐ”์ด์Šค๋ฅผ ์—ฐ๊ฒฐํ•˜๊ณ , ๊ฐ ๋””๋ฐ”์ด์Šค๋ฅผ ๋‹ค๋ฅธ ํ˜ธ์ŠคํŠธ์˜ ๋Œ€์‘ ๋””๋ฐ”์ด์Šค์™€ ์—ฐ๊ฒฐํ•˜๋Š” 2D ๋ฉ”์‹œ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

๊ท ์ผํ•œ ํ™˜๊ฒฝ๋„ ์ข‹์ง€๋งŒ ๋™์ผํ•œ ๊ตฌ์„ฑ์˜ ํ™˜๊ฒฝ์œผ๋กœ ๋ฐ”๊พธ๋ฉด ์ž์—ฐ์Šค๋Ÿฌ์›Œ์งˆ๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค

First, we need to manually calculate the shard group and replicate group. Then, we need to assign the correct shard and
replicate group to each rank.
DeviceMesh๊ฐ€ ์—†๋‹ค๋ฉด, ์–ด๋–ค ๋ณ‘๋ ฌํ™”๋ฅผ ์ ์šฉํ•˜๊ธฐ ์ „์— ๊ฐ ํ”„๋กœ์„ธ์Šค๋งˆ๋‹ค NCCL ํ†ต์‹ ๊ธฐ์™€ CUDA ๋””๋ฐ”์ด์Šค๋ฅผ ์ง์ ‘ ์„ค์ •ํ•ด์•ผ ํ•˜๋ฉฐ, ์ด๋Š” ๊ฝค ๋ณต์žกํ•œ ์ž‘์—…์ž…๋‹ˆ๋‹ค.
๋‹ค์Œ ์ฝ”๋“œ๋Š” :class:`DeviceMesh` ์—†์ด ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ƒค๋”ฉ(hybrid sharding) 2-D ๋ณ‘๋ ฌ ํŒจํ„ด์„ ์„ค์ •ํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2-D๋ณด๋‹ค๋Š” 2์ฐจ์› ์ •๋„๋กœ ๋ฐ”๊พธ๋Š”๊ฒƒ์€ ์–ด๋–จ๊นŒ์š”

With the help of :func:`init_device_mesh`, we can accomplish the above 2D setup in just two lines, and we can still
access the underlying :class:`ProcessGroup` if needed.
:func:`init_device_mesh` ๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์œ„์˜ 2D ์„ค์ •์„ ๋‹จ ๋‘ ์ค„๋กœ ๋๋‚ผ ์ˆ˜ ์žˆ๊ณ , ํ•„์š”ํ•  ๋•Œ๋Š”
๋‚ด๋ถ€์˜ :class:`ProcessGroup` ์—๋„ ๊ทธ๋Œ€๋กœ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

๊ทธ๋Œ€๋กœ๋Š” ์—†์–ด๋„ ๋ ๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค

--------------------------------------------------------
When working with large scale training, you might have more complex custom parallel training composition. For example, you may need to slice out sub-meshes for different parallelism solutions.
DeviceMesh allows users to slice child mesh from the parent mesh and re-use the NCCL communicators already created when the parent mesh is initialized.
๋Œ€๊ทœ๋ชจ ํ•™์Šต ํ™˜๊ฒฝ์—์„œ๋Š” ๋” ๋ณต์žกํ•œ ์‚ฌ์šฉ์ž ์ •์˜ ๋ณ‘๋ ฌ ํ•™์Šต ๊ตฌ์„ฑ์„ ๋‹ค๋ค„์•ผ ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์„œ๋กœ ๋‹ค๋ฅธ ๋ณ‘๋ ฌํ™” ๋ฐฉ์‹์— ๋งž์ถฐ ํ•˜์œ„ ๋ฉ”์‹œ(sub-mesh)๋ฅผ ์ž˜๋ผ๋‚ด์•ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

์ž˜๋ผ๋‚ด๋Š”๊ฒŒ ์˜๋ฏธ๋Š” ๋งž๋Š”๋ฐ ์กฐ๊ธˆ ๋” ์˜์—ญํ•ด๋„ ์ข‹์„๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค
ํ•˜์œ„ ๋ฉ”์‹œ๋ฅผ ๋‚˜๋ˆ„์–ด ์‚ฌ์šฉํ•ด์•ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ์ •๋„๋Š” ์–ด๋–จ๊นŒ์š”

@ehdtjr ehdtjr requested a review from hyoyoung May 29, 2026 17:01
Copy link
Copy Markdown
Member

@hyoyoung hyoyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@jih0-kim jih0-kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jih0-kim jih0-kim merged commit afab02f into PyTorchKorea:master Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants